flowchart LR x1[📜 CSV file] x2[🧹 Cleanup columns] x3[💾 Save CSV file] x4[🎲 Preview random 10 rows] x5[🕵 Look for coinciding text values] x6[🧮 Aggregate data] x7[📊 Visualise data] x1 --> x2 x2 --> x4 x2 --> x5 x2 --> x6 x6 --> x7 x2 --> x3
😊 Hello Thursday
Introduction
Hello world something something
Overview
Step 1. Load the libraries
Code
library(tidyverse)
library(plotly)Step 2. Open the CSV file
Please note that in the code, we carefully identify each of the data types.
Code
cso_dha79 <- read_csv("../../datasets/DHA79.20250129T140111.csv", col_types = list (
`STATISTIC` = col_character(),
`Statistic Label` = col_character(),
`TLIST(A1)` = col_integer(),
`Year` = col_integer(),
`C02076V02508` = col_character(),
`Age Group` = col_character(),
`C02199V02655` = col_character(),
`Sex` = col_character(),
`C02855V03433` = col_character(),
`ISHMT` = col_character(),
`UNIT` = col_character(),
`VALUE` = col_integer()
))
print(cso_dha79)# A tibble: 70,980 × 12
STATISTIC `Statistic Label` `TLIST(A1)` Year C02076V02508 `Age Group`
<chr> <chr> <int> <int> <chr> <chr>
1 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
2 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
3 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
4 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
5 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
6 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
7 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
8 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
9 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
10 DHA79C01 Inpatients 2017 2017 215 0 - 14 years
# ℹ 70,970 more rows
# ℹ 6 more variables: C02199V02655 <chr>, Sex <chr>, C02855V03433 <chr>,
# ISHMT <chr>, UNIT <chr>, VALUE <int>
Step 3. Clean up the dataset
In this step, we:
- Rename the columns, so they are easier to work with programmatically;
- Select only the columns we need for the purpose of this analysis;
- Handle NA values
Code
# Rename the columns, so they are easier to work with programmatically
cso_dha79 <- cso_dha79 %>%
rename(px_type = "Statistic Label") %>%
rename(calendar_year = "Year") %>%
rename(age_group = "Age Group") %>%
rename(medical_sex = "Sex") %>%
rename(medical_ishmt = "ISHMT") %>%
rename(num_cases = "VALUE")
# Select only the columns we need for the purpose of this analysis
cso_dha79_selectedcols <- cso_dha79 %>% select(px_type, calendar_year, age_group, medical_ishmt, medical_sex, num_cases)
# Handle NA values
cso_dha79_selectedcols <- cso_dha79_selectedcols %>% mutate(num_cases = replace_na(num_cases, 0))Step 4. Handle aggregations in the datset
The dataset contains some rows which aggregate the data of other rows (!):
- Where
medical_ishmt= “All causes (A00-Z99 (excl. V, W, X & Y ))” - Where
medical_ishmt= “All causes excluding dialysis daycases (A00-Z99 (excl. Z49.1 & Z49.2, V, W, X & Y)) ))” - Where
medical_sex= “Both sexes” - Where
age_group= “All ages”
These will mess up our own aggregation process, so we will need to handle them.
Note: ISHMT = International shortlist for hospital morbidity tabulation (ISHMT). For more info see this link here.
Code
cso_dha79_selectedcols <- cso_dha79_selectedcols %>%
filter(medical_ishmt != "All causes (A00-Z99 (excl. V, W, X & Y ))") %>%
filter(medical_ishmt != "All causes excluding dialysis daycases (A00-Z99 (excl. Z49.1 & Z49.2, V, W, X & Y)) ))") %>%
filter(medical_sex != "Both sexes") %>%
filter(age_group != "All ages")Step 5. Plot the data using Plotly
gt?
Have a look at the part that starts with library(gt). This is an alternative to print(data_for_plotly).
Prepare the data for plotly:
Code
# Data for plotly
data_for_plotly <- cso_dha79_selectedcols %>%
group_by(age_group, calendar_year) %>%
summarize(
total_num_cases = sum(num_cases),
.groups = "keep"
) %>%
ungroup()
library(gt)
data_for_plotly %>%
gt() %>%
fmt_number(
columns = total_num_cases,
sep_mark = ",",
decimals = 0
) %>%
opt_table_font(
size = "9pt",
font = list(
"Courier",
"monospace"
)
)| age_group | calendar_year | total_num_cases |
|---|---|---|
| 0 - 14 years | 2017 | 406,852 |
| 0 - 14 years | 2018 | 414,298 |
| 0 - 14 years | 2019 | 399,405 |
| 0 - 14 years | 2020 | 299,725 |
| 0 - 14 years | 2021 | 326,061 |
| 0 - 14 years | 2022 | 364,675 |
| 0 - 14 years | 2023 | 366,342 |
| 15 - 24 years | 2017 | 271,048 |
| 15 - 24 years | 2018 | 278,985 |
| 15 - 24 years | 2019 | 276,314 |
| 15 - 24 years | 2020 | 228,310 |
| 15 - 24 years | 2021 | 250,518 |
| 15 - 24 years | 2022 | 266,733 |
| 15 - 24 years | 2023 | 278,088 |
| 25 - 34 years | 2017 | 558,525 |
| 25 - 34 years | 2018 | 533,437 |
| 25 - 34 years | 2019 | 527,279 |
| 25 - 34 years | 2020 | 451,045 |
| 25 - 34 years | 2021 | 491,330 |
| 25 - 34 years | 2022 | 474,808 |
| 25 - 34 years | 2023 | 502,524 |
| 35 - 44 years | 2017 | 682,232 |
| 35 - 44 years | 2018 | 671,191 |
| 35 - 44 years | 2019 | 684,879 |
| 35 - 44 years | 2020 | 590,215 |
| 35 - 44 years | 2021 | 643,042 |
| 35 - 44 years | 2022 | 651,111 |
| 35 - 44 years | 2023 | 695,418 |
| 45 - 54 years | 2017 | 714,570 |
| 45 - 54 years | 2018 | 717,052 |
| 45 - 54 years | 2019 | 728,718 |
| 45 - 54 years | 2020 | 620,874 |
| 45 - 54 years | 2021 | 672,493 |
| 45 - 54 years | 2022 | 719,898 |
| 45 - 54 years | 2023 | 789,992 |
| 55 - 64 years | 2017 | 912,387 |
| 55 - 64 years | 2018 | 918,492 |
| 55 - 64 years | 2019 | 950,748 |
| 55 - 64 years | 2020 | 807,007 |
| 55 - 64 years | 2021 | 860,533 |
| 55 - 64 years | 2022 | 927,146 |
| 55 - 64 years | 2023 | 1,008,080 |
| 65 - 74 years | 2017 | 1,083,371 |
| 65 - 74 years | 2018 | 1,108,931 |
| 65 - 74 years | 2019 | 1,137,802 |
| 65 - 74 years | 2020 | 976,497 |
| 65 - 74 years | 2021 | 1,033,032 |
| 65 - 74 years | 2022 | 1,109,468 |
| 65 - 74 years | 2023 | 1,177,083 |
| 75 - 84 years | 2017 | 757,974 |
| 75 - 84 years | 2018 | 788,348 |
| 75 - 84 years | 2019 | 824,872 |
| 75 - 84 years | 2020 | 714,030 |
| 75 - 84 years | 2021 | 783,437 |
| 75 - 84 years | 2022 | 879,043 |
| 75 - 84 years | 2023 | 952,249 |
| 85 years and over | 2017 | 234,907 |
| 85 years and over | 2018 | 252,414 |
| 85 years and over | 2019 | 264,749 |
| 85 years and over | 2020 | 228,528 |
| 85 years and over | 2021 | 261,273 |
| 85 years and over | 2022 | 286,837 |
| 85 years and over | 2023 | 306,589 |
Code
# print(data_for_plotly)Show the plot:
Code
plot <- data_for_plotly %>%
plot_ly(
x = ~total_num_cases,
type = "box",
name = ~age_group
)
plot